1 Introduction

Now that we gained an in-depth understanding of BLAST-associated E-values and the BLAST output in general, we will use the latter to continue our analysis. In order to create a phylogenetic tree from the BLAST results, a multiple sequence alignment (MSA) must be performed first. The output generated by the MSA is an alignment file, which is needed to calculate the distances for the creation of the phylogenetic tree. In addition, an MSA is great intermediate step to visualize the BLAST output and get a better understanding of the data set.

2 Loading the database and BLAST output

Because the MSA will be carried out in R, we need to read in the text file query_seq_blast_custom.blast generated by the BLAST first. Due to the programs parameters, our BLAST result contains 594 alignments. Performing a MSA on the whole set is impractical - the computers we are using would take multiple days to do the calculations - and unnecessary, given our research question. For this reason, we reduce the sample to the top 20 hits by E-value for further analyses. This cut-off is chosen arbitrarily, but can be adjusted at any time. Do we want to analyse a known subset of orthologs or are we looking for any homologs to gain more information about a previously unknown gene? These types of analyses are often performed iteratively, starting rather broad and narrowing the scope to further investigate interesting findings in later iterations.

library(dplyr)

# specify and set working directory to blast folder
working_dir <- '~/Kurse/Genomanalyse_und_Phylogenie_2025-2026/Praktikum/'
setwd(working_dir)

# specify BLAST output file, read in and set column names
blast_out_file <- 'MSA_files/query_seq_blast_custom.blast'
blast_out <- read.table(blast_out_file, header = FALSE, sep = '\t', quote = '#', skip=5, blank.lines.skip = TRUE)
colnames(blast_out) <- c('subject_acc', 'alignment_length', '%identity', 'gaps', 'score', 'bit_score', 'evalue')

# number of sequences before reduction
nrow(blast_out)
## [1] 594
# reduce BLAST output to top 20 hits by E-value
blast_top_20 <- blast_out %>% slice_min(evalue, n=20)

# print names of top 20 hits
blast_top_20$subject_acc
##  [1] "sp|P38398|BRCA1_HUMAN" "sp|Q9GKK8|BRCA1_PANTR" "sp|Q6J6I8|BRCA1_GORGO"
##  [4] "sp|Q6J6J0|BRCA1_PONPY" "sp|Q6J6I9|BRCA1_MACMU" "sp|Q95153|BRCA1_CANLF"
##  [7] "sp|Q864U1|BRCA1_BOVIN" "sp|O54952|BRCA1_RAT"   "sp|P48754|BRCA1_MOUSE"
## [10] "sp|F4I443|BARD1_ARATH" "sp|Q8RXD4|BRCA1_ARATH" "sp|Q9BZY9|TRI31_HUMAN"
## [13] "sp|B6VQ60|BRCA1_CAEEL" "sp|Q14527|HLTF_HUMAN"  "sp|Q5IFK1|MCPH1_MACFA"
## [16] "sp|Q99728|BARD1_HUMAN" "sp|P61590|MCPH1_COLGU" "sp|O77666|TRI26_PIG"  
## [19] "sp|Q3UWZ0|TRI75_MOUSE" "sp|Q9P3U8|YJ95_SCHPO"

After loading and reducing the set of sequences, we now need to import the actual amino acid sequences from the Uniprot database. The names imported from the database contain a description, that we do not need right now. Removing this string will improve the aesthetics of the MSA plot later.

library(Biostrings)
library(stringr)

# specify uniprot db file and read in
uniprot_db_file <- 'MSA_files/uniprot_sprot.fasta'
uniprot_db <- readAAStringSet(uniprot_db_file)

# print long name from database
uniprot_db[1]@ranges@NAMES
## [1] "sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1"
# remove descriptions from names
uniprot_db@ranges@NAMES <- str_split_i(uniprot_db@ranges@NAMES,' ', i=1)

# reduce db to only include top 20 BLAST hits
blast_mask <- sapply(blast_top_20$subject_acc, function(X) which(names(uniprot_db)==X))
msa_input <- uniprot_db[blast_mask]

# show msa_input
msa_input
## AAStringSet object of length 20:
##      width seq                                              names               
##  [1]  1863 MDLSALRVEEVQNVINAMQKILE...VALYQCQELDTYLIPQIPHSHY sp|P38398|BRCA1_H...
##  [2]  1863 MDLSALRVEEVQNVINAMQKILE...VALYQCQELDTYLIPQIPHSHY sp|Q9GKK8|BRCA1_P...
##  [3]  1863 MDLSALRVEEVQNVINAMQKILE...VALYQCQELDTYLIPQIPHSHY sp|Q6J6I8|BRCA1_G...
##  [4]  1863 MDLSAVRVEEVQNVINAMQKILE...VALYQCQELDTYLIPQIPHSHY sp|Q6J6J0|BRCA1_P...
##  [5]  1863 MDLSAVRVEEVQNVINAMQKILE...VALYQCQELDTYLIPQIPHSHY sp|Q6J6I9|BRCA1_M...
##  ...   ... ...
## [16]   777 MPDNRQPRNRQPRIRSGNEPRSA...WKAPSSWFIDCVMSFELLPLDS sp|Q99728|BARD1_H...
## [17]   841 MAAPILKDVVAYVEVWSSNGTEN...KWVLDSITQHKVCASENYLLPQ sp|P61590|MCPH1_C...
## [18]   545 MATSAPLRSLEEEVTCSICLDYL...FTRRLLPFLWLRWPGTRLLLRP sp|O77666|TRI26_PIG
## [19]   467 MAHVEVLARLQKETKCPICLDDL...PYFYIGPQSEPLRLCSATDSEC sp|Q3UWZ0|TRI75_M...
## [20]   468 MVKRSSHRQVVLDEDDEENYNNN...ASYPRRQARRTRTIQLDSDEES sp|Q9P3U8|YJ95_SCHPO

3 Multiple Sequence Alignment (MSA)

We are now ready to perform the multiple sequence alignment. The bioconductor msa package comes with three different alignment algorithms, namely ClustalW (v 2.1), ClustalOmega and MUSCLE. We will just use ClustalW (v 2.1) without further explanations on what differentiates it from the other two. Just keep in mind that different algorithms will produce different outcomes and results should always be reported alongside a mention of the used algorithm.

library(msa)

# call multiple sequence alignment function and specify algorithm
ClustalWAlignment <- msa(msa_input, "ClustalW")
## use default substitution matrix
# show excerpt from results
ClustalWAlignment
## CLUSTAL 2.1  
## 
## Call:
##    msa(msa_input, "ClustalW")
## 
## MsaAAMultipleAlignment with 20 rows and 1963 columns
##      aln                                                   names
##  [1] -------------------------...------------------------- sp|F4I443|BARD1_A...
##  [2] -------------------------...------------------------- sp|Q8RXD4|BRCA1_A...
##  [3] -------------------------...HY----------------------- sp|Q9GKK8|BRCA1_P...
##  [4] -------------------------...HY----------------------- sp|Q6J6I8|BRCA1_G...
##  [5] -------------------------...HY----------------------- sp|P38398|BRCA1_H...
##  [6] -------------------------...HY----------------------- sp|Q6J6J0|BRCA1_P...
##  [7] -------------------------...HY----------------------- sp|Q6J6I9|BRCA1_M...
##  [8] -------------------------...AADSSQPCV---------------- sp|Q95153|BRCA1_C...
##  [9] -------------------------...------------------------- sp|Q864U1|BRCA1_B... 
##  ... ...
## [13] -------------------------...------------------------- sp|P61590|MCPH1_C...
## [14] -------------------------...LQD---------------------- sp|B6VQ60|BRCA1_C...
## [15] MPDNRQPRNRQPRIRSGNEPRSAPA...------------------------- sp|Q99728|BARD1_H...
## [16] -------------------------...------------------------- sp|Q9BZY9|TRI31_H...
## [17] -------------------------...------------------------- sp|O77666|TRI26_PIG
## [18] -------------------------...------------------------- sp|Q3UWZ0|TRI75_M...
## [19] -------------------------...------------------------- sp|Q9P3U8|YJ95_SCHPO
## [20] --------------------MSWMF...FGTKKPNADEMKQAKINEIRTLIDL sp|Q14527|HLTF_HUMAN
##  Con -------------------------...------------------------- Consensus

We will later use this alignment to calculate the distances for the phylogenetic tree. For this we need to safe the alignment results in FASTA format, with which we are familiar at this point. The file will be created in your current working directory.

# write alignment results to fasta-file
msaPrettyPrint(ClustalWAlignment, output="asis", alFile="MSA_files/Human_BCR1_msa.fasta", askForOverwrite=FALSE)

4 Visualizing the MSA

At this point, we have everything to continue with the generation of the phylogenetic tree. But while we are at it, we can visualize the msa output. We will use the ggmsa package to create plots for the alignment. This package is also part of the bioconductor family, but sadly does not work with the data type created by the msa() function. So we need to convert the data object to ensure compatibility. Lets start with visualizing the alignment for the first 100 amino acids.

library(ggplot2)
library(ggmsa)

# convert MsaAAMultipleAlignment to AAStringSet to ensure compatibility with ggmsa
aln_unma <- unmasked(ClustalWAlignment)

# plot first 100 positions of alignment. To understand all the flags used run ?ggmsa in the console
msa_plt_100 <- ggmsa(aln_unma, start=1, end=100, font="DroidSansMono", color="LETTER",
                     char_width=0.5, by_conservation=TRUE, seq_name=TRUE) +
  scale_x_continuous(breaks = c(1, seq(10,100,by=10))) +
  geom_seqlogo(font="DroidSansMono", color="Chemistry_AA", top=TRUE, adaptive=TRUE) + geom_msaBar()
# save plot to png-file
ggsave(paste0('MSA_files/Human_BCR1_msa_top100.png'),msa_plt_100, width =15, height=5, limitsize=FALSE)

The names of the aligned sequences are shown on the left side of the graph. Each name is followed by the alignment result for the sequence. Because we chose to color the sequences by LETTER, every amino acid is colored differently. The sequence logo on top of the alignment shows the most abundant occurrences at a given position. Higher abundant characters are placed towards the top. These characters are adaptive (adaptive=TRUE), meaning that letter size is scaled with abundance. The color scheme for the logo was set to Chemistry_AA, this means that amino acids with similar chemical properties have similar colors. You can see this for example with Glycine (G), Leucine (L), Proline (P) and Methionine (M) all being colored orange. The bottom of the plot shows the sequence consensus, meaning the prevalent character at a sequence position. The bars show how well the data supports the consensus, the higher and darker the bar, the better the conservation of a certain position. The letters below the bars denote the consensus sequence.

Creating the full alignment plots takes a lot longer as we have a very long alignment. R is also unable to handle more than 1050 sequence positions in one plot, so we need to split it accordingly. As the whole alignment consist of 1966 positions (see output form alignment above) we can split the plot evenly down the middle.

# plot and save first half of alingment
msa_plt_1 <- ggmsa(aln_unma, start=1, end=983, font="DroidSansMono", color = "LETTER",
                   char_width = 0.9, by_conservation=TRUE, seq_name = TRUE) +
  geom_seqlogo(font="DroidSansMono", color = "Chemistry_AA", top=TRUE, adaptive=TRUE) + geom_msaBar()

ggsave(paste0('MSA_files/Human_BCR1_msa_1_983.png'),msa_plt_1, width =100, height=5, limitsize=FALSE)

# plot and save second half of alingment
msa_plt_2 <- ggmsa(aln_unma, start=984, end=1966, font="DroidSansMono", color = "LETTER",
                   char_width = 0.9, by_conservation=TRUE, seq_name = TRUE) +
  geom_seqlogo(font="DroidSansMono", color = "Chemistry_AA", top=TRUE, adaptive=TRUE) + geom_msaBar()

ggsave(paste0('MSA_files/Human_BCR1_msa_984_1966.png'),msa_plt_2, width =100, height=5, limitsize=FALSE)

MSA visualization for the first 100 alignment positions MSA visualization for the first 100 alignment positions

The plots are very long and crowded, so it is hard to make out specific details. One observation that can be made from the plots, is that there are seemingly two groups. The first group consists of way longer proteins compared to the second group. Regions where both groups are aligned show some signs of being conserved. So it could make sense to either only look at these regions or analyse the two groups separately. We should come back later and check if and how the two groups observed here are also represented in the phylogenetic tree.